GERALDA: A framework for integration of genomic information into the database of the Brazilian Health System

Majority of Brazilians are of mixed race according to IBGE. Racial and genetic admixture integrate demographic and health information and it is in public interest that genetic information of demographic groups be used to improve the health system (SUS) of Brazil in the context of databases that use ohmic data to construct artificial inteligence tools. Integration of genomic and ancestry information of Brazilians into public health data, such as the data provided by Information Technology Department of SUS - DATASUS, can help estimate genetic risk of diseases and propose policies to improve diagnosis, allocation of resources, services, and therapies for population groups at higher genetic risk of diseases. Here we introduce a computational framework in the R language - GERALDA - that used mitochondrial variants to estimate the genetic risk of neuroblastoma and neurodegenerative diseases in Brazilians. This work shows an increased genetic risk of disease and integrate on cognition, morbidity, and mortality of Brazilians using mitochondrial DNA variants. This information can be used for the organization of the public health system, contributing to the rational use of resources by the health system.

Gepoliano Chaves https://www.britannica.com/animal/quokka (University of Chicago) , Pablo Ivan Ramos https://www.britannica.com/animal/bilby (CIDACS) , Marilda Gonçalves https://www.britannica.com/animal/bilby (Instituto Gonçalo Moniz)
2024-10-04

1 Significance

Our group identified hypoxia as a triggering signal for the cellular transition from adrenergic (ADRN) to mesenchymal (MES) cells in neuroblastoma. We found that the transition is mediated by the deposition of the epigenetic marker 5-hydroxymethyl-cytosine (5-hmC) via the ten eleven translocation enzyme (TET1), a functional mechanism of hypomethylation. We are investigating hypomethylation patterns via 5-hmC in immortalized cells, tumor samples and cell-free DNA (cfDNA) isolated from peripheral blood liquid biopsies of neuroblastoma patients. Repetitive genomic elements, or repetitive regions of the genome, are part of the non-coding region and are important for the maintenance of the pluripotent cellular state of stem cells, which we call the de-differentiated state. Mesenchymal stem cells (MES) are dedifferentiated and their gene expression pattern shows a high correlation with the stem and pluripotent cell state.

We evaluated the deposition of the epigenetic marker 5-hmC in repetitive regions of the genomes of cells, tumors, and cf-DNA of neuroblastoma patients for oncological management of patients. We aim to incorporate liquid biopsy with sequencing of 5-hmC in repetitive element markers into the routine of public precision medicine programs for neuroblastoma in Brazil. To this end, we propose the construction of a computational database that incorporates genomic and demographic data available in the health system, to allow better data analysis, machine learning, and artificial intelligence using ohmic marker data for management of nervous system diseases in the public health system, together with the Center for Data Integration and Knowledge for Health of the Gonçalo Moniz Institute at Fiocruz in Bahia.

2 Introduction

2.1 Cellular transition and reprogramming

Neuroblastoma is a pediatric cancer of the peripheral nervous system. Tumors are composed of two main cell lineages: adrenergic (ADRN) and mesenchymal (MES) cells. These cells can interconvert using enhancers and superenhancers Groningen et al. (2017) (van Groningen et al. 2017; Boeva et al. 2017), epigenetic markers in noncoding genome regions identified using machine learning by hierarchical clustering (Van Groningen 2017) and principal component analysis (Boeva et al., 2017). Despite the importance of the mechanism of ADRN-to-MES cell interconversion, the cellular signals that trigger the transition are not understood. ADRN cells are neuron-like and sensitive to chemotherapy (Figure 1, left), while MES cells are similar to undifferentiated or dedifferentiated stem cells (Figure 1, right) and responsible for resistance to chemotherapy and immunotherapy (Van Groningen et al., 2017; Kendsersky et al., 2022; Mabe et al., 2022).

Hypoxia is the condition of limited oxygen supply for tumor growth. Among extracellular signals that activate 5-hmC deposition via TET1, our group at the University of Chicago identified hypoxia as a signal that activates gene expression by 5-hmC deposition and methylation removal (Mariani et al., 2014). Our studies and those of other groups have shown that hypoxia drives dedifferentiation in neuroblastoma cells (Jögi et al. 2002; Mariani et al. 2014; Hains et al. 2022). We identified a functional epigenetic demethylation mechanism mediated by Ten Eleven Translocation (TET) family enzymes, and genes activated by hypoxia in the transition from the ADRN to MES cellular state (Chaves et al. 2024 - in preparation), (Figure 5). Our group’s results propose a functional mechanism of hypoxia-driven methylation removal and gene expression activation (Mariani et al., 2014; Hains et al., 2022; Chaves et al. 2024 - in preparation) (Figures 2 and 5). Our work suggests hypoxia as a mechanism for maintaining cellular dedifferentiation, consistent with the hypomethylation of repetitive transposable elements (Figure 5).

Most CpG islands are located in repetitive intergenic regions of the genome (Lanciano and Cristofari 2020) and, when methylated, serve to maintain an inactive state of DNA transcription (Fetahu and Taschner-Mandl 2021). Repetitive transposable DNA elements (TEs) are mobile genetic elements that make up a large fraction of the genome, reaching 15% in C. elegans and 85% in maize (Lanciano and Cristofari, 2020). These percentages show the potential of these elements as biomarkers (Lanciano and Cristofari, 2020). TEs have been considered “junk DNA” (Burns 2017; Ma et al. 2022) and their value in epigenetic modulation in neuroblastoma and as biomarkers is undetermined.

TEs are active in human embryonic stem cells and, through functional hypomethylation, participate in the activation of gene expression in a mechanism compatible with TET1 activity and 5-hmC deposition (Ma et al., 2022). Retrotransposons such as HERV-H demarcate topologically associated domains (TADs) in chromatin, which maintain the cellular pluripotency state (Zhang et al., 2019). These observations agree with the hypothesis of induction of a stem cell (dedifferentiated) state in hypoxia-activated ADRN cells, involving repetitive or transposable elements, which lead to the expression of MES genes (Figure 5). Important biomarkers can be identified among TE elements, since they represent a high fraction of the genome composition. Thus, the detection of repetitive elements of the genome in cfDNA isolated by liquid biopsy can help in the management of cancer patients (Figure 2).

Studies from our group have identified 5-hmC deposition patterns (Figure 3A) in tumors (Applebaum et al., 2019) and liquid biopsy cfDNA (Applebaum et al., 2020) using the nano 5-hmC-seal method developed by Dr. Chuan He from the University of Chicago (Figure 3B). We propose that the TET1 enzyme, activated by hypoxia, causes oxidation of methylated DNA sequences and formation of 5-hydroxymethyl-cytosine (5-hmC) hypomethylation regions (Figure 3A). Thus, in the present project, sequencing the 5-hmC marker in Brazil will aid in patient management, as done in the USA (Chennakesavalu, Moore, Chaves, et al. 2024). We postulate that it will be possible to quantify ADRN and MES phenotypes in tumors in Brazilians, as we have recently done (Vayani, Chaves et al. 2023).

2.2 Mitochondria in cellular dedifferentiation or reprogramming in the nervous system

Mitochondria are important organelles in cellular metabolism and the citric acid cycle and play a role in pluripotency and differentiation at the embryonic stage (Carey et al. 2015; Hensley, Wasti, and DeBerardinis 2013; Qing et al. 2012; Yoo et al. 2020). Mitochondrial DNA is small and maternally derived, with approximately 16,000 base pairs. Mitochondria provide the cellular supply of ATP for the nervous system, playing an important role in diseases of this system, such as neuroblastoma and neurodegenerations. In neuroblastoma, John Maris’ group conducted case-control studies in Caucasians from the USA and observed that mitochondrial haplogroups (genetic variants) are associated with a reduced risk of the disease (Chang et al. 2020). The same group observed that mitochondrial single nucleotide polymorphism (SNP), rs2853493, is associated with the risk of neuroblastoma, including integrateing the expression of the mitochondrial cytochrome B gene, MT-CYB (Chang et al. 2022).

In neurodegenerative diseases, Tranah and collaborators (2012) observed that in elderly Caucasians, different mitochondrial haplogroups present an increased risk of developing dementia and cognitive decline (Tranah et al. 2012). The group reported that in African-Americans, haplogroup L1 represents a greater risk of developing dementia when compared to haplogroup L3, which is more common in this ethnic-racial group. Also for haplogroup L3, the SNP p.V193I, a substitution in the ND2 gene, was associated with increased levels of amyloid plaque, a phenotype of Alzheimer’s disease (Tranah et al. 2014). Neuroblastoma arises during sympathetic neurogenesis. Neurogenesis and nervous system development pathways are suppressed in mitochondrial haplogroups in neuroblastoma and this can be associated with the underlying mechanism of reduced risk associated with mitochondrial haplogroups investigated in Chang et al. (2020). Due to its importance for dementia and neuroblastoma, computational methods need to be developed to quantify genetic risk associated with mitochondrial sequences, especially in non-US White populations, for which there is scarse data available. The choice of mitochondrial sequences considers an existing cross-talk between mitochondrial metabolism and activation of repetitive genomic elements (Baeken, Moosmann, and Hajieva 2020; Larsen et al. 2017; Stoccoro and Coppedè 2021; Lopes 2020; Bravo et al. 2020) that has implications in the role of mitochondrial haplogroups in activation of the immune system, as discussed by the work of Chang et al. (2020).

To allow contrasting demographic data on nervous system diseases in the context of the public health system of Brazil, we propose exploring mortality data from neuroblastoma and neurodegenerations in the public database of the Informatics Department of the Ministry of Health of Brazil (DATASUS). This will allow the development of computational tools capable of estimating genetic risk for different Brazilian racial groups, a long necessary approach to the health system of Brazil, based on the identification of mitochondrial variants that confer a higher risk of neuroblastoma and neurodegenerations, using data curated from the literature (Tranah et al. 2012; Tranah et al. 2014; Chang et al. (2020); Chang et al. 2022).

2.3 Management of public health data

Management and decision-making processes in health systems occur with development of tools capable of processing and analyzing data generated by such systems, investigating the dynamics that affect mortality in various morbidities, such as those that affect the nervous system, which represent a significant proportion of Government expenditures. Recently, private interest of pharmaceutical companies leveraged the power of the biobank of the United Kingdom to develop machine learning and artificial intelligence tools to predict diseases and phenotypes using the information contained in the UK Biobank, as presented in the study carried out by Garg et al. (2024). In Brazil, the Federal Constitution of 1988 established the Unified Health System (in Portuguese, Sistema Único de Saúde - SUS), and after that, the SUS Department of Informatics (DATASUS) was created to organize the data collected by SUS (Saldanha, Rocha Bastos, and Barcellos, 2019). More recently, Programa Genomas Brasil was implemented with the goal to sequence 100,000 nationals to inform precision medicine policies in the public health system. Our group verified the persistence of racial inequality in the risk and survival of neuroblastoma (Chennakesavalu et al. 2023). Due to the genetic admixture present in Brazil since the beginning of colonization, a considerable part of the Brazilian population is of mixed race or black, which raises questions about the incidence, risk and treatment of neuroblastoma patients in Brazil considering their racial identification. Brazil is marked by profound socioeconomic inequalities related to the ethnic-racial origin of the population, which include but are not limited to digital literacy (Araújo da Silva and Behar 2019), the use of programming languages in genomic sciences (Sano et al. 2024; Vera-Choqqueccota et al. 2024), and racial inequality in health. The latter was recently identified by the Longitudinal Study of Adult Health (ELSA-Brazil), a study conducted with public workers in Brazilian public institutions (ELSA Brazil 2023). The creation of SUS-linked databases to conduct genomic research must therefore take into account the integrate of Brazilian genetic admixture on both access to digital literacy and the health of Brazilians themselves. Considering the requirement of public resources for genomic studies and genomic literacy of researchers and scientists in Brazil, we propose the Geralda framework as a concept to guide the integration of genomic information into the demographic database of SUS, as presented in the Materials and Methods section.

3 Materials and Methods

3.1 Health System Mortality Data

The R package Microdatasus was used to access mortality data on neuroblastoma and neurodegenerative diseases as described by Freitas Saldanha et al. (2019).

3.2 Classification of Mitochondrial Haplogroups

Mitochondrial haplogroups were classified using Haplogrep3 as described in Schonherans et al. (2023). This classification algorithm generates a csv file that can be used with other R packages to understand the genetic risk associated with neuroblastoma and neurodegenerations for each haplogroup.

3.3 Estimation of Genetic Risk per Large Geographic Region

Risk was estimated and ploted using geobr as described by Pereira and Goncalves (2024). Geobr is a computational package to download official spatial data sets of Brazil. The package includes a wide range of geospatial data in geopackage format (like shapefiles), available at various geographic scales and for various years with harmonized attributes, projection and topology. This allows us to achieve a spatial-geographic organization of the data provided by the DATASUS department of the Ministry of Health.

3.4 General Genomic Algorithmic Data (Geralda) Management

The adoption of genomic information and epigenetic markers, such as the 5-hmC marker in neuroblastoma, into the SUS system involves challenges, which necessarily include activities to support digital literacy and the use of computer programming technologies in genomics throughout the country. Thus, this project includes, in the methodological part, collaborative genomic research between CIDACS in Brazil and my doctoral and postdoctoral institutions in the USA, to mediate the teaching of computer programming languages for genomic research and the construction of the database for the application of machine learning to demographic data from the SUS. CIDACS proposed the harmonization of databases of social and health indicators, creating the Cohort of 100 million Brazilians, making important contributions to national health and epidemiology (Barreto et al. 2021). Integrating genomic information databases into the pipelines that allow investigations of the social and demographic data in Brazil can enrich and improve public policies of the nacional public health system of Brazil. This also has a potential to generate protocols for the use of machine learning and artificial intelligence in disease classification in the health system.

Figure 1: Framework of the GERALDA pipeline. Starting with fasta or fastq sequences, samples are aligned to the reference genome. Once VCF files are produced out of each sample, custom scripts are used to extract the genotypes of interest that will be used as features to inform the machine learning algorithm to classify discrete categories. Each haplotype thus identified is then used to label a racial group in the Microdatasus dataframe. This information is then used to estimate the genetic risk of each racial group in the DATASUS dataframe.

4 Results and Discussion

4.1 Classification of Brazilian Mitochondrial Haplogroups

Race is not recognized by science as a valuable system to classify human groups. However, it is necessary to remind humans and their gorvernments of the historical records of scientific racism, social darwinism and eugenic practices that Brazil and the USA alike have practiced Rambaran-Olm and Wade (2021). In Brazil, even racial self-classification has social and political consequences. Although it is acknowleged that the country has a historical genetic admixture involving the Portuguese and other European, African and Native American populations, Chor and Araujo Lima (2005) report a historical struggle for Brazil with racial inequality in access to health. Racial classification systems have been difficult to implement in Brazil because of the known genetic admixture started with the arrival of the first Portuguese. Difficulty in racial classification can be perceived in the day-to-day life. The term “afroconveniência”, for example, which is difficult to translate into English, was created in Brazil to describe people too “light-skinned” to claim African ancestry according to Silva et al. (2023). Among the systems for racial classification, ancestry, self-identification and genealogical records maintained by government for identification have been proposed. In the USA, where the government has used racial identification for immigratory, marriage and citizenship policies, the white identity was constructed arround the immigration from England and other Northern European countries. In Brazil, construction of the white identity followed a different pattern where not genealogical records but rather, European phenotypes of the Southern European colonization primarialy of Portuguese but also importantly, Italian and Spanish ancestries, constructed the white identity. According to Chor and Araujo Lima (2005), IBGE adopted self-declaration for racial classification purposes.

Biological, genetic or genomic ancestries can inform the dinamics of ancient human population migration as well as the interaction of the human populations with the environment Tranah et al. (2012). The genomic ancestry investigated using mitochondrial sequences can also inform the genetic risk for diseases of the nervous system such as neurodegenerations and neuroblastoma. Ethnical and religious groups such as Jews use mitochondrial sequences to estimate matrilineal ancestry (Feder et al. (2008)) and we chose mitochondrial sequences because of the role of mitochondria in metabolism reprogrammation and de-differentiation, which we identified in neuroblastoma to be associated with tumor progression (Chaves et al., 2024 - in preparation). To investigate the ancestry of the self-declared white population in Brazil, we used Haplogrep3 to classify the matrilinear lineage of self-declared white Brazilians aiming to quantify genetic risks for variants known to affect the nervous system (Table 1).


Table 1: Identification of mitochondrial haplogroups in self-declared white Brazilians, identified using Haplogrep 3 (Schönherr, Weissensteiner, and Kronenberg 2023). Samples present haplogroups J and K. These genotypes were related to the risk of dementia in the 2012 Tranah study in individuals of European ancestry. In individuals of African ancestry, haplogroup L1 is identified, which presents an increased risk of developing dementia, according to Tranah 2014. Also according to the 2014 Tranah study, the most common haplogroup among people of African ancestry, haplogroup L3, which is also observed in this sample from Brazil with 4 individuals (4 counts), presents higher levels of amyloid plaque deposition. This suggests that these individuals represent a risk group for the development of dementia among Brazilians.

 Region 
 Northeast   South   Southeast 
 Origin 
   African  18 30
   Amerindian/Asian  7 15
   European  14 17
   #Total cases  39 17 45


Samples depicted in Table 1 derive from Haplogrep 3. They can be visualized as the number of sequences per continent of origin. The haplogroups.regions object is used to investigate the risk of disease in the nervous system per regions of Brazil, based on the genotypes of the mitochondrial DNA sequences for each of the large regions. To understand the risk of nervous system diseases we calculate the incidence of mitochondrial variants in the large regions. The code counts each of the haplogroups in Brazil.

This approach allows calculation of the rate of incidence of each haplogroup and the frequency of the genetic variants in the Brazilian large regions. N is the total number of samples and we can calculate the frequency dividing the n total number haplogroups by total number of samples N in a region:

In this visualization, we include a SNP (single nucleotide polymorphism) in the last column:

Table 1: Table 2: Mitochondrial Haplogroups per Region
SampleID Genotype Origin Region Found_Polys
AF243627 A2 Amerindian/Asian Northeast 152C! 16111T 16126C 16223T 16259T 16290T 16319A 16362C
AF243628 G1 Amerindian/Asian Northeast 16223T 16325C 16362C
AF243629 B4 Amerindian/Asian Northeast 16189C 16217C
AF243630 B2 Amerindian/Asian Northeast 16189C 16217C 16249C 16312G 16344T
AF243631 A+ Amerindian/Asian Northeast 16223T 16290T 16319A 16362C
AF243632 C1 Amerindian/Asian Northeast 16223T 16298C 16325C 16327T 16362C
AF243633 M7 Amerindian/Asian Northeast 16223T 16295T 16362C
AF243634 L1 African Northeast 1438G! 15301A! 16126C 16129A! 16187T 16189C 16223T 16264T 16270T 16278T 16293G 16311C
AF243635 L3 African Northeast 16176T 16223T 16327T
AF243636 M4 African Northeast 6131G! 16223T 16294T 16294T
AF243637 L3 African Northeast 16223T 16327T
AF243638 L0 African Northeast 73G! 146C! 182T! 195C! 263G! 15301A! 16129A 16148T 16168T 16172C 16187T 16188G 16189C 16223T 16230G 16278T! 16311C 16320T
AF243639 L2 African Northeast 150T! 182T! 16189C 16192T 16223T 16278T 16294T 16309G 16311C!
AF243640 L3 African Northeast 16124C 16223T
AF243641 L3 African Northeast 16185T 16223T 16327T
AF243642 L2 African Northeast 16223T 16264T 16278T 16311C!
AF243643 L2 African Northeast 150T! 182T! 16189C 16223T 16225T 16234T 16278T 16294T 16309G 16311C!
AF243644 L1 African Northeast 15301A! 16129A 16187T 16189C 16214T 16223T 16265C 16278T 16291T 16294T 16311C 16360T
AF243645 L1 African Northeast 15301A! 16129A 16187T 16189C 16223T 16265C 16278T 16286G 16294T 16311C 16360T
AF243646 L2 African Northeast 150T! 182T! 16223T 16278T 16294T 16309G 16311C!
AF243647 L0 African Northeast 73G! 146C! 182T! 195C! 263G! 15301A! 16129A 16148T 16168T 16172C 16187T 16188G 16189C 16223T 16230G 16278T! 16311C 16320T
AF243648 L3 African Northeast 16124C 16223T
AF243649 L3 African Northeast 16172C 16223T 16327T
AF243650 L2 African Northeast 150T! 182T! 16223T 16278T 16294T 16309G 16311C!
AF243651 L3 African Northeast 16129A 16209C 16223T 16292T 16295T 16311C
AF243652 H1 European Northeast 16309G
AF243653 H1 European Northeast 16362C
AF243654 J European Northeast 16069T 16126C
AF243655 HV European Northeast 16234T 16311C 16362C
AF243656 H1 European Northeast 16075C 16189C 16356C
AF243657 K1 European Northeast 16093C 16224C 16311C 16319A
AF243658 H7 European Northeast 16221T
AF243659 K European Northeast 16224C 16311C
AF243660 H2 European Northeast
AF243661 H1 European Northeast 16189C 16356C
AF243662 T2 European Northeast 16126C 16294T 16296T 16304C
AF243663 H2 European Northeast 16189C
AF243664 V7 European Northeast 16153A 16298C
AF243665 H3 European Northeast 16293G
AF243666 L3 African Southeast 750G! 16223T 16265T
AF243667 L2 African Southeast 16111A 16145A 16184T 16223T 16239T 16278T 16292T 16311C 16355T
AF243668 L3 African Southeast 750G! 16223T 16265T
AF243669 L1 African Southeast 15301A! 16086C 16129A 16187T 16189C 16223T 16241G 16274A 16278T 16291T 16293G 16294T 16311C 16360T
AF243670 L3 African Southeast 16185T 16209C 16223T 16327T
AF243671 L2 African Southeast 150T! 195C! 16223T 16224C 16278T 16311C!
AF243672 L2 African Southeast 16223T 16264T 16278T 16311C 16311C
AF243673 L0 African Southeast 73G! 146C! 182T! 185A! 195C! 263G! 15301A! 16129A! 16148T 16172C 16187T 16188G 16189C 16223T 16230G 16278T! 16311C 16320T
AF243674 M5 African Southeast 16223T 16278T 16294T
AF243675 L1 African Southeast 15301A! 16071T 16129A 16145A 16187T 16189C 16213A 16223T 16234T 16265C 16278T 16286G 16294T 16311C 16360T
AF243676 U6 African Southeast 16172C 16189C 16219G 16278T
AF243677 U6 African Southeast 16172C 16189C 16219G 16278T 16362C
AF243678 L4 African Southeast 5460A! 16223T 16293T 16311C 16355T 16362C
AF243679 L2 African Southeast 16114A 16129A 16213A 16223T 16278T 16311C!
AF243680 L1 African Southeast 1438G! 15301A! 16126C 16129A! 16187T 16189C 16223T 16264T 16270T 16278T 16311C
AF243681 L4 African Southeast 5460A! 16223T 16293T 16311C 16355T 16362C
AF243682 X1 African Southeast 16104T 16189C 16223T 16278T!
AF243683 L1 African Southeast 195C! 2283T! 7055G! 15301A! 16104T 16129A! 16163G 16187T 16189C 16223T 16278T 16293G 16294T 16311C 16360T
AF243684 L3 African Southeast 10398G! 16185T 16223T 16327T!
AF243685 L0 African Southeast 73G! 146C! 152C! 182T! 195C! 263G! 15301A! 16093C 16129A 16148T 16168T 16172C 16187T 16188G 16189C 16223T 16230G 16278T 16278T 16293G 16311C 16320T
AF243686 L3 African Southeast 16185T 16223T 16327T
AF243687 L1 African Southeast 15301A! 16129A 16187T 16189C 16223T 16278T 16293G 16294T 16311C 16360T
AF243688 L1 African Southeast 195C! 7055G! 15301A! 16129A 16163G 16187T 16189C 16209C 16223T 16278T 16293G 16294T 16311C 16360T
AF243689 L1 African Southeast 15301A! 16086C! 16129A! 16189C 16223T 16278T 16293G 16294T 16311C 16360T
AF243690 L3 African Southeast 16223T 16320T
AF243691 L3 African Southeast 16172C 16189C 16223T 16320T
AF243692 M5 African Southeast 16223T 16278T 16294T
AF243693 L3 African Southeast 16172C 16189C 16223T 16311C 16320T
AF243694 L1 African Southeast 198T! 10398G! 15301A! 16129A 16187T 16189C 16223T! 16278T 16293G 16294T 16311C 16360T
AF243695 L3 African Southeast 16172C 16189C 16223T 16320T
AF243696 A+ Amerindian/Asian Southeast 16189C 16223T 16290T 16319A 16362C
AF243697 A2 Amerindian/Asian Southeast 152C! 16097C 16098G 16111T 16223T 16290T 16319A 16362C
AF243698 C1 Amerindian/Asian Southeast 16223T 16325C 16327T
AF243699 A2 Amerindian/Asian Southeast 152C! 16111T! 16126C 16223T 16278T 16290T 16319A 16362C
AF243700 A2 Amerindian/Asian Southeast 152C! 16111T 16192T 16223T 16290T 16319A 16362C
AF243701 A8 Amerindian/Asian Southeast 16223T 16242T 16290T 16319A
AF243702 B4 Amerindian/Asian Southeast 16189C 16217C
AF243703 B4 Amerindian/Asian Southeast 16189C 16217C
AF243704 G1 Amerindian/Asian Southeast 16223T 16325C 16362C
AF243705 C1 Amerindian/Asian Southeast 16223T 16298C 16325C 16327T
AF243706 A2 Amerindian/Asian Southeast 152C! 16111T 16223T 16290T 16319A 16362C
AF243707 B4 Amerindian/Asian Southeast 16189C 16217C
AF243708 C1 Amerindian/Asian Southeast 16223T 16298C 16325C 16327T
AF243709 B4 Amerindian/Asian Southeast 16189C 16217C
AF243710 A2 Amerindian/Asian Southeast 152C! 16111T 16189C 16223T 16290T 16319A 16362C
AF243780 H5 European South 16304C
AF243781 H1 European South 16162G
AF243782 U5 European South 16144C 16189C 16192T! 16270T
AF243783 T2 European South 16126C 16153A 16294T 16296T
AF243784 H2 European South
AF243785 J1 European South 16069T 16126C 16261T
AF243786 H2 European South 16124C 16354T
AF243787 H7 European South 16213A
AF243788 T2 European South 16126C 16147T 16294T 16296T 16297C 16304C
AF243789 H3 European South 16093C
AF243790 X2 European South 16189C 16223T 16248T 16278T
AF243791 HV European South 16298C
AF243792 U5 European South 16189C 16192T! 16270T
AF243793 K1 European South 16224C 16311C 16319A
AF243794 U7 European South 16309G 16318T
AF243795 J1 European South 2706G! 16069T 16126C 16222T
AF243796 R0 European South 16126C 16362C


These results suggest that the identity of self-declared white individuals in Brazil is not limited to those of strict European matrilinear ancestry. This is in potential agreement with the ideology of racial democracy proposed by Gilberto Freire. It also suggests that individuals that self-identify as white, as well as public governmental policies should consider the genetic risk associated with African variants in individuals that self-identify as white. At this point it is not possible to establish the cause-effect scenario but higher mortality among self-declared white individuals due to neuroblastoma is predominant in white individuals, either because self-declared white individuals use the public health system more often than non-whites (an evidence of structural racism) or because the genetic risk variants are as well present in self-declared white individuals. We calculate an incidence of 32% for haplogroup L3 in the Northeast region (Table 2).

L3 haplogroup is not present among self-declared white individuals in the South of Brazil. The observation of mitochondrial haplogroups of African ancestry in self-declared white Brazilians (Figure 7) is consistent with Fridman’s (2014) observation of 35% African matrilineal lineage in self-declared white individuals in Brazil. An estimated 80% European ancestry was calculated for the Y chromosomal lineage of Brazilians. Using the weighted mean proportions technique, Souza et al. (2019) calculated 68.1%, 19.6% and 11.6% for the parental ancestries of European, African and Native American ancestries for the Brazilian population. On the other hand, if haplogroups L1 and L3 confer a greater risk to individuals identified as white for diseases of the nervous system, for neuroblastoma, haplogroup K, of European origin, presents a reduced risk of according to Chang 2020, suggesting that Brazilians who self-declare as white and who have the European mitochondrial haplogroup K genotype are less susceptible to neuroblastoma than individuals of African matrilineal lineage (Figure 6).

4.2 Applying genomic model to public health system

4.2.2. Mortality in the Racial Groups

To apply models such as the one described in Chaves et al., 2024 (in preparation) to genomic research in diseases of the nervous system in Brazilians, we propose a collaborative genomic research between UCSC and UChicago in the USA and CIDACS at Instituto Gonçalo Moniz in Brazil. To allow the collaboratoin, we accessed neuroblastoma mortality data from Brazilians, using the Microdatasus R package. We found a predominance of deaths of self-declared white individuals in the first decade of the 2000s (Figure 3). The prevalence of self-declared white individuals in neuroblastoma mortality data in Brazil can be attributed to the ideology of “whitening” (Pena et al. 2011) in the Brazilian racial identity (Mitchell 2022) and to gender asymmetry in interracial relationships in Brazil (Pena 2007). After 2010, a decrease in the proportion of mortality of self-declared white individuals was observed (Figure 3). In Figure 2, we plot the mortality rate of neuroblastoma in Brazil using the dados_nb_appended_melted object that was saved somewhere else and the code below.

library(ggplot2)
nb_mortality_df <- readRDS("../../ReComBio Scientific/geraldo/data/dados_nb_appended_melted.rds")
p_nb <- nb_mortality_df %>% 
  ggplot(aes(x = year, y = value, 
             fill = raca_cor_factor)) + 
  geom_bar(position="fill", stat="identity")
ggplotly(p_nb)

Figure 2: Annual mortality due to neuroblastoma in a sample of the Brazilian population between 2000-2015, made available by the SUS Information Technology Department, DATASUS. Mortality is calculated using the R language through the Microdatasus library. Artwork by @allison_horst

Since the TET1/5-hmC hypomethylation genomic model proposed in Chaves et al., 2024 can only be applied to Brazilians after sequencing 5-hmC from cfDNA of Brazilian patients, the current MVP of the machine learning tool proposed in this project is a proof of concept and analyzes mitochondrial DNA of Brazilians published by Alves-Silva et al., 2000. The prevalence of self-declared white individuals in the neuroblastoma mortality data (Figure 3) is in line with the African ancestry (indicated in purple) of haplogroups L1 and L3 in self declared white individuals in Brazil and a higher risk associated of the incidence of nervous system diseases in these haplogroups according to Tranah (Tranah et al. 2012; Tranah et al. 2014) (Figure 2).

Because we identified genetic variants associated with higher risk of diseases in the nervous system in the Brazilian mitochondrial sequences, we decided to look at the incidence of these mitochondrial haplogroups per large geographic region of Brazil, using the geographic information present in the Alves-Silva et al. (2000) study. To do that we accessed geographic information about the neuroblastoma death rates of the self-identifying racial groups in the geographic regions of Brazil.

and mortality of neuroblastoma by race:

data_nb_2013_estados_perct <- readRDS("../../ReComBio Scientific/geraldo/data/data_nb_2013_estados_perct.rds")

Then we uploaded geographic (state) data using the read_state function from the geobr R package as follows:

# read all states
states <- read_state(
  year = 2019, 
  showProgress = FALSE
  )

To integrate genomic and geographic information, we joined the states (from geobr) dataframe and the data_estados_brancos_perct (total mortality from Microdatasus) dataframe:

states_perct_brancos <- dplyr::left_join(states, data_estados_brancos_perct, 
                                         by = c("abbrev_state" = "UF"))


and states_nb_2013 (neuroblastoma mortality from Microdatasus) databases:

states_nb_2013 <- dplyr::left_join(states, data_nb_2013_estados_perct, 
                                         by = c("abbrev_state" = "UF"))


Because the main ethnico-racial group affected by neuroblastoma mortality in Brazil was the self-reported white group, to begin looking into the spatial and geographic information of the demographic data stored by DATASUS, we visualized the mortality rate of self-declared white individuals in the states of Brazil (data not shown). We obtained the number of self-declared white individuals in each state. With that number, we can estimate the number of self-declared white individuals reported as passing away by the SUS health system in 2014. We observed as expected that proportionally, more self-reported white individuals passed away in Southern Brazil than any other of the large geographic regions (data not shown).

Then we ploted the mortality rate of neuroblastoma in the white race in Brazil:

ggplot() +
  geom_sf(data=states_nb_2013, aes(fill=branca_perc), 
          color= "black", size=.15) + ## Color here is the line of the border
    labs(subtitle="", size=8) +
    scale_fill_distiller(palette = "Reds", name="Mortality\nRate", direction=+1, 
                         limits = c(0,1)) +
    theme_minimal() #+no_axis
Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.

Figure 3: Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.


The mortality rate of neuroblastoma in the black race in Brazil:

ggplot() +
  geom_sf(data=states_nb_2013, aes(fill=preta_perc), 
          color= "black", size=.15) + ## Color here is the line of the border
    labs(subtitle="", size=8) +
    scale_fill_distiller(palette = "Reds", name="Mortality\nRate", direction=+1, 
                         limits = c(0,0.3)) +
    theme_minimal() #+no_axis
Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.

Figure 4: Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.


And the mortality rate of neuroblastoma in the yellow race in Brazil:

ggplot() +
  geom_sf(data=states_nb_2013, aes(fill=amarela_perc), 
          color= "black", size=.15) + ## Color here is the line of the border
    labs(subtitle="", size=8) +
    scale_fill_distiller(palette = "Reds", name="Mortality\nRate", direction=+1, 
                         limits = c(0,1)) +
    theme_minimal() #+no_axis
Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.

Figure 5: Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.


Having established where the highest incidence of self-reported white individuals mortality rate is, we begin accessing the incidence of the mitochondrial haplogroups that confer protection or risk of diseases in the nervous system. In Table 3 we observe the incidence of each haplogroup identified in the mitochondrial sequences isolated from self-declared white Brazilians (Table 3). Column Region in that table, can be used to merge the genotype information to the demographic information contained in this other table extracted using the Microdatasus R package.

dados_estados_nb_2013 <- readRDS("../../R/R journal/data/dados_estados_nb_2013.rds")
dim(dados_estados_nb_2013)
[1] 302   8
# kable(dados_estados_nb_2013, caption="dados_estados_nb_2013") %>%
#   kable_styling("striped", full_width = F, font_size = 12) %>%
#   scroll_box(width = "100%", height = "600px")

head(dados_estados_nb_2013)
      CONTADOR RACACOR  DTOBITO CAUSABAS   DTNASC IDADE UF    Region
6222      6222       1 23022013     C749 30122003   409 TO     North
31026    24155       1 16022013     C749 09022012   401 SP Southeast
44075    37204    <NA> 09042013     C749 25102009   403 SP Southeast
44097    37226       2 08082013     C749 03042012   401 SP Southeast
46745    39874       1 10012013     C749 22031953   459 SP Southeast
48051    41180       1 19012013     C749 02032011   401 SP Southeast


Note that both Table 3 and dados_estados_nb_2013 have a column named Regions, depicting the large Regions of Brazil. A column similar to this column can be used to store information about State, Municipality, City and local information about the health unit that is serving the patient in the public health system.


4.3 Estimation of Genetic Risk

After estimating the mortality of individuals that identify with the white race per region in Brazil (Figure 4), we begin estimating the incidence of the mitochondrial haplogroups by geographic regions. Toparslan et al. (2020) proposed an R workflow for phylogenetic analysis and visualizations of mitochondrial sequences. Chang et al. (2020), Tranah et al. (2012), Tranah et al. (2014), Feder et al. (2008) and Kofler et al. (2009) have reported these haplogroups to associate with the genetic risk for diseases of the nervous system. We now estimate which mitochondrial lineages are exposed to the highest risks along the territory of Brazil.

Haplogroup J

This haplogroup is predominant in Southern Brazil, with significant presence in the Northeast. Haplogroup J was reported by Tranah et al. (2012) to be associated with cognitive impairment. Feder et al. (2008) identified that this haplogroup also associates with type 2 diabetes in Ashkenazi Jews, and we identified (Chaves et al. (2019)) a genetic mechanism that causes this disease to be more frequent in people that have neurodegenerative diseases.

Estimation of incidence of mitochondrial haplogroup J in populations of the large regions Brazil.

Figure 6: Estimation of incidence of mitochondrial haplogroup J in populations of the large regions Brazil.


Haplogroup K

This haplogroup was found predominant in Southern and Northeasthern Brazil in this study. Chang et al. (2020) reported that haplogroup K protects against neuroblastoma and is associated with protection against the high risk neuroblastoma disease, the most aggressive form of the disease. It is possible that this haplogroup is associated with increased inflammatory response and T-cell infiltration in hot neuroblastoma tumors via mitochondrial reprogramming of metabolism in cancer and the immunological cells. Of note, we identified the highest incidence of the haplogroup K in the Brazilian population of the Northeast, nationally known for the arrival of the Portuguese in 1500 in Porto Seguro - Bahia. Considering the recent Italian immigration to the South and Southeast, responsible for a considerable amount of the European immigration to Brazil, it is unexpected that this protective haplogroup to be so highly present in the Northeast region. One possible explanation for this observation is the immigration of people from the Netherlands to the state of Pernambuco, also known as the Dutch invasions of Brazil.

Estimation of incidence of mitochondrial haplogroup K in populations of the large regions Brazil.

Figure 7: Estimation of incidence of mitochondrial haplogroup K in populations of the large regions Brazil.


Haplogroup L3

This haplogroup is predominant in Southeastern and Northeastern Brazil. Haplogroup L3 was reported by Tranah et al. (2014) to be associated with cognitive impairment in African Americans. It is possible that this haplogroup is associated with increased mortality by neuroblastoma in self-declared white Brazilians of mitochondrial African ancestry.

Estimation of incidence of  mitochondrial haplogroup L3 in populations of the large regions Brazil.

Figure 8: Estimation of incidence of mitochondrial haplogroup L3 in populations of the large regions Brazil.


Haplogroup T

Haplogroup T was detected in samples from Northeastern and Southern Brazil. According to a study by Kofler et al. (2009), mitochondrial DNA haplogroup T is associated with coronary artery disease and diabetic retinopathy. Chang et al. (2020) also reported association between haplogroup T and neuroblastoma.

Estimation of incidence of  mitochondrial haplogroup T in populations of the large regions Brazil.

Figure 9: Estimation of incidence of mitochondrial haplogroup T in populations of the large regions Brazil.


5 Conclusions

X. Chang, M. Bakay, Y. Liu, J. Glessner, K. S. Rathi, J. M. Maris and H. Hakonarson. Mitochondrial DNA haplogroups and susceptibility to neuroblastoma. 2020. JNCI J Natl Cancer Inst (2020) 112(12): djaa024.
G. Chaves, J. Stanley and N. Pourmand. Mutant huntingtin affects diabetes and alzheimer’s markers in human and cell models of huntington’s disease. 2019. Cells 2019, 8, 962; doi:10.3390/cells8090962.
D. Chor and C. R. de Araujo Lima. Aspectos epidemiológicos das desigualdades raciais em saúde no brasil. 2005. Cad. Saúde Pública, Rio de Janeiro, 21(5):1586-1594, set-out, 2005.
J. Feder, I. Blech, O. Ovadia, S. Amar, J. Wainstein, I. Raz and D. Mishmar. Differences in mtDNA haplogroup distribution among 3 jewish populations alter susceptibility to T2DM complications. 2008. BMC Genomics 2008, 9:198 doi:10.1186/1471-2164-9-198.
R. de Freitas Saldanha, R. R. Bastos and C. Barcellos. Microdatasus: A package for downloading and preprocessing microdata from brazilian health informatics department (DATASUS). 2019. URL 10.1590/0102-311X00032419. Microdatasus.
M. Garg, M. Karpinski, D. Matelska, L. Middleton, O. S. Burren, F. Hu, E. Wheeler, K. R. Smith and D. Vitsios. Disease prediction with multi-omics and biomarkers empowers case–control genetic discoveries in the UK biobank. 2024. https://doi.org/10.1038/s41588-024-01898-1.
T. van Groningen, J. Koster, L. J. Valentijn, D. A. Zwijnenburg, B. A. Westerman, J. van Nes and R. Versteeg. Neuroblastoma is composed of two super-enhancer-associated differentiation states. 2017. URL https://doi.org/10.1038/ng.3899. van Groningen paper.
B. Kofler, E. E. Mueller, W. Eder, O. Stanger, R. Maier, M. Weger, F. A. Zimmermann, J. A. Mayr and W. Sperl. Mitochondrial DNA haplogroup t is associated with coronary artery disease and diabetic retinopathy: A case control study. 2009. BMC Medical Genetics 2009, 10:35 doi:10.1186/1471-2350-10-35.
R. H. M. Pereira and C. N. Goncalves. Geobr: Download official spatial data sets of brazil. 2024. URL https://ipeagit.github.io/geobr/. R package version 1.9.1, https://github.com/ipeaGIT/geobr.
M. Rambaran-Olm and E. Wade. The many myths of the term “anglo-saxon.” 2021.
S. Schonherans, H. Weissensteine, F. Kronenberg and L. Forer. Haplogrep 3 - an interactive haplogroup classification and analysis platform. 2023. URL https://doi.org/10.1093/nar/gkad284. Haplogrep3.
G. M. Silva, V. T. Daflon and C. Giraut. Seeing race like a state: Higher education affirmative action verification commissions in brazil. 2023. https://doi.org/10.1017/lap.2023.18.
A. M. de Souza, S. S. Resende, T. N. de Sousa and C. F. A. de Brito. A systematic scoping review of the genetic ancestry of the brazilian population. 2019. Genetics and Molecular Biology, 42, 3, 495-508 (2019).
E. Toparslan, K. Karabag and U. Bilge. A workflow with r: Phylogenetic analyses and visualizations using mitochondrial cytochrome b gene sequences. 2020. https://doi.org/10.1371/journal.pone.0243927.
G. J. Tranah, M. A. Nalls, S. M. Katzman, J. S. Yokoyama, E. T. Lam, Y. Zhao and S. Mooney. Mitochondrial DNA sequence variation associated with dementia and cognitive function in the elderly. 2012. J Alzheimers Dis. 2012 ; 32(2): 357–372. doi:10.3233/JAD-2012-120466.
G. J. Tranah, J. S. Yokoyama, S. M. Katzman, M. A. Nalls, A. B. Newman and T. B. Harris. Mitochondrial DNA sequence associations with dementia and amyloid-b in elderly african americans. 2014. Neurobiology of Aging 35 (2014) 442.e1e442.e8.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Chaves, et al., "GERALDA: A framework for integration of genomic information into the database of the Brazilian Health System", The R Journal, 2024

BibTeX citation

@article{quokka-bilby,
  author = {Chaves, Gepoliano and Ramos, Pablo Ivan and Gonçalves, Marilda},
  title = {GERALDA: A framework for integration of genomic information into the database of the Brazilian Health System},
  journal = {The R Journal},
  year = {2024},
  issn = {2073-4859},
  pages = {1}
}